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1. INTRODUCTION 

PageRank [3] is a Web page ranking technique that has been 
a fundamental ingredient in the development and success of 
the Google search engine. The method is still one of the 
many signals that Google uses to determine which pages are 
most important^] The main idea behind PageRank is to 
determine the importance of a Web page in terms of the 
importance assigned to the pages hyperlinking to it. In fact, 
this thesis is not new, and has been previously successfully 
exploited in different contexts. We review the PageRank 
method and link it to some renowned previous techniques 
that we have found in the fields of Web information retrieval, 
bibliometrics, sociometry, and econometrics. 

2. WEB INFORMATION RETRIEVAL 

In 1945 Vannevar Bush wrote a today celebrated article in 
The Atlantic Monthly entitled "As We May Think" describ- 
ing a futuristic device he called Memex [5]. Bush writes: 

"Wholly new forms of encyclopedias will appear, 
ready made with a mesh of associative trails run- 
ning through them, ready to be dropped into the 
Memex and there amplified. " 

Bush's prediction came true in 1989, when Tim Berners-Lee 
proposed the Hypertext Markup Language (HTML) to keep 
track of experimental data at the European Organization 
for Nuclear Research (CERN). In the original far-sighted 
proposal in which Berners-Lee attempts to persuade CERN 
management to adopt the new global hypertext system we 
can read the following paragraprQ: 

: http : //www. google . com/corporate/tech. html 
2 http : //www. w3 . org/History/1989/proposal .html 



"We should work toward a universal linked infor- 
mation system, in which generality and portabil- 
ity are more important than fancy graphics tech- 
niques and complex extra facilities. The aim would 
be to allow a place to be found for any informa- 
tion or reference which one felt was important, 
and a way of finding it afterwards. The result 
should be sufficiently attractive to use that the 
information contained would grow past a critical 
threshold. " 



As we all know, the proposal was accepted and later imple- 
mented in a mesh - this was the only name that Berners-Lee 
originally used to describe the Web - of interconnected doc- 
uments that rapidly grew beyond the CERN threshold, as 
Berners-Lee anticipated, and became the World Wide Web. 

Today, the Web is a huge, dynamic, self-organized, and hy- 
perlinked data source, very different from traditional doc- 
ument collections which are nonlinked, mostly static, cen- 
trally collected and organized by specialists. These features 
make Web information retrieval quite different from tradi- 
tional information retrieval and call for new search abilities, 
like automatic crawling and indexing of the Web. Moreover, 
early search engines ranked responses using only a content 
score, which measures the similarity between the page and 
the query. One simple example is just a count of the num- 
ber of times the query words occur on the page, or perhaps 
a weighted count with more weight on title words. These 
traditional query-dependent techniques suffered under the 
gigantic size of the Web and the death grip of spammers. 

In 1998, Sergey Brin and Larry Page revolutionised the field 
of Web information retrieval by introducing the notion of an 
importance score, which gauges the status of a page, inde- 
pendently from the user query, by analysing the topology of 
the Web graph. The method was implemented in the famous 
PageRank algorithm and both the traditional content score 
and the new importance score were efficiently combined in 
a new search engine named Google. 

3. RANKING WEB PAGES USING PAGE- 
RANK 

We briefly recall how the PageRank method works keeping 
the mathematical machinery to the minimum. Interested 
readers can more thoroughly investigate the topic in a recent 
book of Langville and Meyer which elegantly describes the 



science of search engine rankings in a rigorous yet playful 
style [15]. 

We start by providing an intuitive interpretation of Page- 
Rank in terms of random walks on graphs [21] . The Web is 
viewed as a directed graph of pages connected by hyperlinks. 
A random surfer starts from an arbitrary page and simply 
keeps clicking on successive links at random, bouncing from 
page to page. The PageRank value of a page corresponds 
to the relative frequency that the random surfer visits that 
page, assuming that the surfer goes on infinitely. The more 
time spent by the random surfer on a page, the higher the 
PageRank importance of the page. 

A little more formally, the method can be described as fol- 
lows. Let us denote by qi the number of distinct outgoing 
(hyper)links of page i. Let H = (h% t j) be a square ma- 
trix of size equal to the number n of Web pages such that 
hi j = l/<jj if there exists a link from page i to page j and 
hij = otherwise. The value hi t j can be interpreted as 
the probability that the random surfer moves from page i to 
page j by clicking on one of the distinct links of page i. The 
PageRank nj of page j is recursively defined as: 

i 

or, in matrix notation, n — nH. Hence, the PageRank of 
page j is the sum of the PageRank scores of pages i linking 
to j, weighted by the probability of going from i to j. In 
words, the PageRank thesis reads as follows: 

A Web page is important if it is pointed to by 
other important pages. 

There are in fact three distinct factors that determine the 
PageRank of a page: (i) the number of links it receives, (ii) 
the link propensity, that is, the number of outgoing links, 
of the linking pages, and (iii) the PageRank of the linking 
pages. The first factor is not surprising: the more links a 
page receives, the more important it is perceived. Reason- 
ably, the link value depreciates proportionally to the num- 
ber of links given out by a page: endorsements coming from 
parsimonious pages are worthier than those emanated by 
spendthrift ones. Finally, not all pages are created equal: 
links from important pages are more valuable than those 
from obscure ones. 

Unfortunately, this ideal model has two problems that pre- 
vent the solution of the system. The first one is due to the 
presence of dangling nodes, that are pages with no forward 
linksQ These pages capture the random surfer indefinitely. 
Notice that a dangling node corresponds to a row in ma- 
trix H with all entries equal to 0. To tackle the problem of 
dangling nodes, the corresponding rows in H are replaced 
by the uniform probability vector u = 1/ne, where e is a 
vector of length n with all components equal to 1. Alter- 
natively, one may use any fixed probability vector in place 
of u. This means that the random surfer escapes from the 

3 The term dangling refers to the fact that many dangling 
nodes are in fact pendent Web pages found by the crawling 
spiders but whose links have not been yet explored. 




Figure 1: A PageRank instance with solution. Each 
node is labelled with its PageRank score. Scores 
have been normalized to sum to 100. We assumed 

a = 0.85. 



dangling page by jumping to a randomly chosen page. We 
call S the resulting matrix. 

The second problem with the ideal model is that the surfer 
can get trapped into a bucket of the Web graph, which is a 
reachable strongly connected component without outgoing 
edges towards the rest of the graph. The solution proposed 
by Brin and Page is to replace matrix S by the Google matrix 

G = aS + (1 - a)E 

where E is the teleportation matrix with identical rows each 
equal to the uniform probability vector u, and a is a free 
parameter of the algorithm often called the damping factor. 
Alternatively, a fixed personalization probability vector v 
can be used in place on u. In particular, the personalization 
vector can be exploited to bias the result of the method 
towards certain topics. The interpretation of the new system 
is that, with probability a the random surfer moves forward 
by following links, and, with the complementary probability 
1 — a the surfer gets bored of following links and enters a new 
destination in the browser's URL line, possibly unrelated 
to the current page. The surfer is hence teleported, like a 
Star Trek character, to that page, even if there exists no 
link connecting the current and the destination pages in the 
Web universe. The inventors of PageRank propose to set 
the damping factor a — 0.85, meaning that after about five 
link clicks the random surfer chooses a random page. 

The PageRank vector is then defined as the solution of equa- 
tion: 



■k = ttG (1) 

An example is provided in Figure [T] Node A is a dangling 
node, while nodes B and C form a bucket. Notice the dy- 



namics of the method: page C receives just one link but from 
the most important page B; its importance is much higher 
than that of page E, which receives many more links, but 
from anonymous pages. Pages G, H, I, L, and M do not re- 
ceive endorsements; their scores correspond to the minimum 
amount of status of each page. 

Typically, the normalization condition £\ Hi = 1 is also 
added. In this case Equation [1] becomes n = airS + (1 — 
ot)u. The latter distinguishes two factors contributing to the 
PageRank vector: an endogenous factor equal to irS which 
takes into consideration the real topology of the Web graph, 
and an exogenous factor equal to the uniform probability 
vector u, which can be interpreted as a minimal amount of 
status assigned to each page independently of the hyperlink 
graph. The parameter a balances between these two factors. 

4. COMPUTING THE PAGERANK VECTOR 

Does Equation [T] have a solution? Is the solution unique? 
Can we efficiently compute it? The success of the PageRank 
method rests on the answers to these queries. Luckily, all 
these questions have nice answers. 

Thanks to the dangling nodes patch, matrix S is a stochas- 
tic matrijfl, and clearly the teleportation matrix E is also 
stochastic. It follows that G is stochastic as well, since it is 
defined as a convex combination of stochastic matrices S and 
E. It is easy to show that, if G is stochastic, Equation [T] has 
always at least one solution. Hence, we have got at least one 
PageRank vector. Having two independent PageRank vec- 
tors, however, would be already too much: which one should 
we use to rank Web pages? Here, a fundamental result of 
algebra comes to the rescue : Perron- Frobenius theorem [231 
[6]. It states that, if A is an irreducibl^fl nonnegative square 
matrix, then there exists a unique vector x, called the Perron 
vector, such that xA = rx, x > 0, and Xi — 1, where r is 
the maximum eigenvalue of A in absolute value, that alge- 
braists call the spectral radius of A. The Perron vector is the 
left dominant eigenvector of A, that is, the left eigenvector 
associated with the largest eigenvalue in magnitude. 

The matrix S is most likely reducible, since experiments 
have shown that the Web has a bow-tie structure fragmented 
into four main continents that are not mutually reachable, as 
first observed in [3] . Thanks to the teleportation trick, how- 
ever, the graph of matrix G is strongly connected. Hence G 
is irreducible and Perron-Frobenius theorem appliefl There- 
fore, a positive PageRank vector exists and is furthermore 
unique. 

Interestingly, we can arrive at the same result using Markov 
theory [T5]. The above described random walk on the Web 
graph, modified with the teleportation jumps, naturally in- 
duces a finite-state Markov chain, whose transition matrix 
is the stochastic matrix G. Since G is irreducible, the chain 
has a unique stationary distribution corresponding to the 
PageRank vector. 

4 This simply means that all rows sum up to 1. 

5 A matrix is irreducible if and only if the directed graph 
associated with it is strongly connected, that is, for every 
pair i and j of graph nodes there are paths leading from i 
to j and from j to i. 

6 Since G is stochastic, its spectral radius is 1. 



Year 


Author 


Contribution 


1906 


Markov 


Markov theory 19 


1907 


Perron 


Perron theorem 1231 
i — — i 


1912 


Frobenius 


Perron-Frobenius theorem [6] 


1929 


von Mises &: 


Power method [30] 




Pollaczek- Geiringer 




1941 


Leontief 


Econometric model 17 


1949 


Seeley 


Sociometric model 1281 
i — — i 


1952 


Wei 


Sport ranking model [31] 


1953 


Katz 


Sociometric model [10] 


1965 


Hubbell 


Sociometric model [5J 


1976 


Pinski & Narin 


Bibliometric model [25] 


1998 


Kleinberg 


HITS Q3] 


1998 


Brin & Page 


PageRank 3 



Table 1: PageRank history. 

A last crucial question remains: can we efficiently compute 
the PageRank vector? The success of PageRank is largely 
due to the existence of a fast method to compute its val- 
ues: the power method, a simple iteration method to find 
the dominant eigenpair of a matrix developed by von Mises 
and Pollaczek-Geiringer [3D]. It works as follows on the 
Google matrix G. Let TP ' = u = 1/ne. Repeatedly com- 
pute 7r (fc+1) = 7r (fe) G until ||7r (fc+1) - 7r (fc) || < €, where || • || 
measures the distance between the two successive PageRank 
vectors and e is the desired precision. 

The convergence rate of the power method is approximately 
the rate at which a k approaches to 0: the closer a to unity, 
the lower the convergence speed of the power method. If, for 
instance, a — 0.85, as many as 43 iterations are sufficient to 
gain 3 digits of accuracy, and 142 iterations are enough for 
10 digits of accuracy. Notice that the power method applied 
to matrix G can be easily expressed in terms of matrix H , 
which, unlike G, is a very sparse matrix that can be stored 
using a linear amount of memory with respect to the size of 
the Web. 

5. STANDING ON THE SHOULDERS OF GI- 
ANTS 

Dwarfs standing on the shoulders of giants is a Western 
metaphor meaning "One who develops future intellectual 
pursuits by understanding the research and works created 
by notable thinkers of the pasf'Q The metaphor was fa- 
mously uttered by Isaac Newton: "If I have seen a little 
further it is by standing on the shoulders of Giants". More- 
over, "Stand on the shoulders of giants" is Google Scholar's 
motto: "the phrase is our acknowledgement that much of 
scholarly research involves building on what others have al- 
ready discovered". 

There are many giants upon whose shoulders PageRank firmly 
stands: Markov [TPJ, Perron [53], Frobenius .6 , von Mises 
and Pollaczek-Geiringer [30] provided at the beginning of the 
1900's the necessary mathematical machinery to investigate 
and effectively solve the PageRank problem. Moreover, the 
circular PageRank thesis has been previously exploited in 

7 From the Wikipedia page for Standing on the shoulders of 
giants. 




Figure 2: A HITS instance with solution (compare 
with PageRank scores in Figure [l]). Each node is 
labelled with its authority (top) and hub (bottom) 
scores. Scores have been normalized to sum to 100. 
The dominant eigenvalue for both authority and hub 
matrices is 10.7. 



different contexts, including Web information retrieval, bib- 
liometrics, sociometry, and econometrics. In the following, 
we review these contributions and link them to the Page- 
Rank method. Table [T] contains a brief summary of Page- 
Rank history. All the ranking techniques surveyed in this 
paper have been implemented in R [26] and the code is freely 
available at the author's Web page. 

5.1 Hubs and authorities on the Web 

Hypertext Induced Topic Search (HITS) is a Web page rank- 
ing method proposed by Jon Kleinberg |13l 114] . The connec- 
tions between HITS and PageRank are striking. Despite the 
close conceptual, temporal and even geographical proximity 
of the two approaches, it appears that HITS and PageRank 
have been developed independently. In fact, both papers 
presenting PageRank [3] and HITS [14] are today citational 
blockbusters: the PageRank article collected 6167 citations, 
while the HITS paper has been cited 4617 times0 

HITS thinks of Web pages as authorities and hubs. HITS 
circular thesis reads as follows: 

Good authorities are pages that are pointed to by 
good hubs and good hubs are pages that point to 
good authorities. 

Let L — (Uj) be the adjacency matrix of the Web graph, 



i.e., lij — 1 if page i links to page j and kj = otherwise. 
We denote with L T the transpose of L. HITS defines a pair 
of recursive equations as follows, where x is the authority 
vector containing the authority scores and y is the hub vector 
containing the hub scores: 



where k > 1 and i/ ' = e, the vector of all ones. The first 
equation tells us that authoritative pages are those pointed 
to by good hub pages, while the second equation claims that 
good hubs are pages that point to authoritative pages. No- 
tice that Equation [2] is equivalent to: 



»W = LL T y^ (6) 

It follows that the authority vector x is the dominant right 
eigenvector of the authority matrix A — L T L, and the hub 
vector y is the dominant right eigenvector of the hub matrix 
H = LL T . This is very similar to the PageRank method, 
except the use of the authority and hub matrices instead of 
the Google matrix. 

To compute the dominant eigenpair (eigenvector and eigen- 
value) of the authority matrix we can again exploit the 
power method as follows: let x^ = e. Repeatedly com- 
pute z (fe) = Ax (fe_1) and normalize £ (fc) = z (fe) /m{x {k) ), 
where m(x^) is the signed component of maximal magni- 
tude, until the desired precision is achieved. It follows that 
' converges to the dominant eigenvector x (the authority 
vector) and m(x^) converges to the dominant eigenvalue 
(the spectral radius, which is not necessarily 1). The hub 
vector y is then given by y — Lx. While the convergence of 
the power method is guaranteed, the computed solution is 
not necessarily unique, since the authority and hub matri- 
ces are not necessarily irreducible. A modification similar to 
the teleportation trick used for the PageRank method can 
be applied to HITS to recover the uniqueness of the solu- 
tion [33]. 

An example of HITS is given in Figure [2] We stress the 
difference among importance, as computed by PageRank, 
and authority and hubness, as computed by HITS. Page B 
is both important and authoritative, but it is not a good 
hub. Page C is important but by no means authoritative. 
Pages G, H, I are neither important nor authoritative, but 
they are the best hubs of the network, since they point to 
good authorities only. Notice that the hub score of B is 
although B has one outgoing edge; unfortunately for B, the 
only page C linked by B has no authority. Similarly, C has 
no authority because it is pointed to only by B, whose hub 
score is zero. This shows the difference between indegree 
and authority, as well as between outdegree and hubness. 
Finally, we observe that nodes with null authority scores 
(respectively, null hub scores) correspond to isolated nodes 
in the graph whose adjacency matrix is the authority matrix 
A (respectively, the hub matrix H). 



Source: Google Scholar on February 5, 2010. 



An advantage of HITS with respect to PageRank is that it 



provides two scores at the price of one. The user is hence 
provided with two rankings: the most authoritative pages 
about the research topic, which can be exploited to investi- 
gate in depth a research subject, and the most hubby pages, 
which correspond to portal pages linking to the research 
topic from which a broad search can be started. A disad- 
vantage of HITS is the higher susceptibility of the method to 
spamming: while it is difficult to add incoming links to our 
favourite page, the addition of outgoing links is much eas- 
ier. This leads to the possibility of purposely inflating the 
hub score of a page, indirectly influencing also the authority 
scores of the pointed pages. 

An following algorithm that incorporates ideas from both 
PageRank and HITS is SALSA [IB]: like HITS, SALSA com- 
putes both authority and hub scores, and like PageRank, 
these scores are obtained from Markov chains. 

5.2 Bibliometrics 

Bibliometrics, also known as scientometrics, is the quantita- 
tive study of the process of scholarly publication of research 
achievements. The most mundane aspect of this branch of 
information and library science is the design and applica- 
tion of bibliometric indicators to determine the influence of 
bibliometric units like scholars and academic journals. The 
Impact Factor is, undoubtedly, the most popular and con- 
troversial journal bibliometric indicator available at the mo- 
ment. It is defined, for a given journal and a fixed year, 
as the mean number of citations in the year to papers pub- 
lished in the two previous years. It has been proposed in 
1963 by Eugene Garfield, the founder of the Institute for Sci- 
entific Information (ISI), working together with Irv Sher [7]. 
Journal Impact Factors are currently published in the popu- 
lar Journal Citation Reports by Thomson-Reuters, the new 
owner of the ISI. 

The Impact Factor does not take into account the impor- 
tance of the citing journals: citations from highly reputed 
journals are weighted as those from obscure journals. In 
1976 Gabriel Pinski and Francis Narin developed an innova- 
tive journal ranking method [25] • The method measures the 
influence of a journal in terms of the influence of the citing 
journals. The Pinski and Narin thesis is: 

A journal is influential if it is cited by other in- 
fluential journals. 

This is the same circular thesis of the PageRank method. 
Given a source time window Ti and a previous target time 
window T2, the journal citation system can be viewed as 
a weighted directed graph in which nodes are journals and 
there is an edge from journal i to journal j if there is some 
article published in i during Ti that cites an article published 
in j during Ti. The edge is weighted with the number Cij 
of such citations from i to j. Let d = . dj be the total 
number of cited references of journal i. 

In the method described by Pinski and Narin, a citation ma- 
trix H — is constructed such that hij = Ci,j/cj. The 
coefficient h% t j is the amount of citations received by journal 
j from journal i per reference given out by journal j. For 




Figure 3: An instance with solution of the jour- 
nal ranking method proposed by Pinski and Narin. 
Nodes are labelled with influence scores and edges 
with the citation flow between journals. Scores have 
been normalized to sum to 100. 



each journal an influence score is determined which mea- 
sures the relative journal performance per given reference. 
The influence score -Kj of journal j is defined as: 

TYj = S2 TTi— = Y~] TrAj 
C 3 

% % 

or, in matrix notation: 



ti = -kH (4) 

Hence, journals j with a large total influence njCj are those 
that receive significant endorsements from influential jour- 
nals. Notice that the influence per reference score nj of a 
journal j is a size independent measure, since the formula 
normalizes by the number of cited references Cj contained in 
articles of the journal, which is an estimation of the size of 
the journal. Moreover, the normalization neutralizes the ef- 
fect of journal self-citations, that are citations between arti- 
cles in the same journal. These citations are indeed counted 
both at the numerator and at the denominator of the influ- 
ence score formula. This avoids over inflating journals that 
engage in the practice of opportunistic self-citations. 

It can be proved that the spectral radius of matrix H is 
1, hence the influence score vector corresponds to the domi- 
nant eigenvector of H [8]. In principle, the uniqueness of the 
solution and the convergence of the power method to it are 
not guaranteed. Nevertheless, both properties are not diffi- 
cult to obtain in real cases. If the citation graph is strongly 
connected, then the solution is unique. When journals be- 
long to the same research field, this condition is typically 
satisfied. Moreover, if there exists a self-loop in the graph, 
that is an article that cites an article in the same journal, 
then the power method converges. 

Figure[3]provides an example of the Pinski and Narin method 
Notice that the graph is strongly connected and has a self- 
loop, hence the solution is unique and can be computed with 
the power method. Both journals A and C receive the same 
number of citations and give out the same number of refer- 
ences. Nevertheless, the influence of A is bigger, since it is 
cited by a more influential journal (B instead of D). Further- 
more, A and D receive the same number of citations from 
the same journals, but D is larger than A, since it contains 
more references, hence the influence of A is higher. 

Similar recursive methods have been independently proposed 



by [18] and [22] in the context of ranking of economics jour- 
nals. Recently, various PageRank-inspired bibliometric in- 
dicators to evaluate the importance of journals using the 
academic citation network have been proposed and exten- 
sively tested: journal PageRank [2], Eigenfactor [33], and 
SCImago [27]. 

5.3 Sociometry 

Sociometry, the quantitative study of social relationships, 
contains remarkably old PageRank predecessors. Sociolo- 
gists were the first to use the network approach to investi- 
gate the properties of groups of people related in some way. 
They devised measures like indegree, closeness, betweeness, 
as well as eigenvector centrality which are still used today 
in modern (not necessarily social) network analysis [20]. In 
particular, eigenvector centrality uses the same central in- 
gredient of PageRank applied to a social network: 



A person is prestigious if he is endorsed by pres- 
tigious people. 



John R. Seeley in 1949 is probably the first in this context 
to use the circular argument of PageRank [28]. Seeley rea- 
sons in terms of social relationships among children: each 
child chooses other children in a social group with a non- 
negative strength. The author notices that the total choice 
strengths received by each children is inadequate as an in- 
dex of popularity, since it does not consider the popularity 
of the chooser. Hence, he proposes to define the popularity 
of a child as a function of the popularity of those children 
who chose the child, and the popularity of the choosers as a 
function of the popularity of those who chose them and so 
in an "indefinitely repeated reflection". Seeley exposes the 
problem in terms of linear equations and uses Cramer's rule 
to solve the linear system. He does not discuss the issue of 
uniqueness. 

Another model is proposed in 1953 by Leo Katz [10) . Katz 
views a social network as a directed graph where nodes are 
people and person i is connected by an edge to person j if 
i chooses, or endorses, j. The status of member i is defined 
as the number of weighted paths reaching j in the network, 
a generalization of the indegree measure. Long paths are 
weighted less than short ones, since endorsements devalue 
over long chains. Notice that this method indirectly takes 
account of who endorses as well as how many endorse an 
individual: if a node i points to a node j and i is reached 
by many paths, then the paths leading to i arrive also at j 
in one additional step. 

Katz builds an adjacency matrix L = (h.j) such that — 1 
if person i chooses person j and hj = otherwise. He de- 
fines a matrix W = ^2 < ^L 1 (ciL) k , where a is an attenuation 
constant. Notice that the (i,j) component of L k is the num- 
ber of paths of length k from i to j, and this number is at- 
tenuated by a k in the computation of W. Hence, the (i,j) 
component of the limit matrix W is the weighted number of 
arbitrary paths from i to j. Finally, the status of member j 
is Hj — "Wi,j, that is, the number of weighted paths reach- 
ing j. If the attenuation factor a < l/p(L), with p(L) the 
spectral radius of L, then the above series for W converges. 




Figure 4: An example of the Katz model using two 
attenuation factors: a — 0.9 and a = 0.1 (the spectral 
radius of the adjacency matrix L is 1). Each node 
is labelled with the Katz score corresponding to a = 
0.9 (top) and a = 0.1 (bottom). Scores have been 
normalized to sum to 100. 



Figure [4] illustrates the method with an example. Notice the 
important role of the attenuation factor: when it is large 
(close to l/p(L)), long paths are devalued smoothly, and 
Katz scores are strongly correlated with PageRank ones. In 
the shown example, PageRank and Katz methods provide 
the same ranking of nodes when the attenuation factor is 
0.9. On the other hand, if the attenuation factor is small 
(close to 0), then the contribution given by paths longer 
than 1 rapidly declines, and thus Katz scores converge to 
indegrees, the number of incoming links of nodes. In the 
example, when the attenuation factor drops to 0.1, nodes C 
and E switch their positions in the ranking: node E, which 
receives many short paths, significantly increases its score, 
while node C, which is the destination of just one short path 
and many (devalued) long ones, significantly decreases its 
score. 

In 1965 Charles H. Hubbell generalizes the proposal of Katz [9]. 
Given a set of members of a social context, Hubbell defines 
a matrix W = {wij) such that Wi,j is the strength at which 
i endorses j. Interestingly, these weights can be arbitrary, 
and in particular, they can be negative. The prestige of a 
member is recursively defined in terms of the prestige of the 
endorsers and takes account of the endorsement strengths: 



7T = 7T W + V 



(5) 




Figure 5: An instance of the Hubbell model with 
solution: each node is labelled with its prestige 
score and each edge is labelled with the endorsement 
strength between the connected members; negative 
strength is highlighted with dashed edges. The min- 
imal amount of status has been fixed to 0.2 for all 
members. 



The term v is an exogenous vector such that Vi is a minimal 
amount of status assigned to i from outside the system. 

The original aspects of the method are the presence of an 
exogenous initial input and the possibility of giving nega- 
tive endorsements. A consequence of negative endorsements 
is that the status of an actor can also be negative. An ac- 
tor that receives a positive (respectively, negative) judgment 
from a member of positive status increases (respectively, de- 
creases) his prestige. On the other hand, and interestingly, 
receiving a positive judgment from a member of negative 
status makes a negative contribution to the prestige of the 
endorsed member (if you are endorsed by some person af- 
filiated to the Mafia your reputation might drop indeed). 
Moreover, receiving a negative endorsement from a mem- 
ber of negative status makes a positive contribution to the 
prestige of the endorsed person (if the same Mafioso opposes 
you, then your reputation might raise). 

Figure [S] shows an example for the Hubbell model. Notice 
that Charles does not receive any endorsement and hence 
has the minimal amount of status given by default to each 
member. David receives only negative judgments; interest- 
ingly, the fact that he has a positive self opinion further 
decreases his status. A better strategy for him, knowing in 
advance of his negative status, would be to negatively judge 
himself, acknowledging the negative judgment given by the 
other members. 

Equation is equivalent to n(I — W) = v, where I is the 
identity matrix, that is tt = v(I - W)' 1 = v J2iLo W\ The 
series converge if and only if the spectral radius of W is less 
than 1. It is now clear that the Hubbell model is a general- 
ization of the Katz model to general matrices that adds an 
initial exogenous input v. Indeed, Katz equation for social 
status is 7r = e ( a -^)\ where e is a vector of all ones. In 

an unpublished note Vigna traces the history of the mathe- 
matics of spectral ranking and shows that there is a reduc- 
tion from the path summation formulation of Hubbell-Katz 
to the eigenvector formulation with teleportation of Page- 
Rank and vice versa [29]. In the mapping the attenuation 
constant is the counterpart of the PageRank damping fac- 
tor, and the exogenous vector corresponds to the PageRank 
personalization vector. The interpretation of PageRank as 
a sum of weighted paths is also investigated in pQ. 



Spectral ranking methods have been also exploited to rank 
sport teams in competitions that involve teams playing in 
pairs [311 112j . The underlying idea is that a team is strong 
if it won against other strong teams. Much of the art of the 
sport ranking problem is how to define the matrix entries 
dij expressing how much team i is better than team j (e.g., 
we could pick dij to be 1 if j beats i, 0.5 if the game ended 
in a tie, and otherwise) [11] . 

5.4 Econometrics 

We conclude with a succinct description of the input-output 
model developed in 1941 by Nobel Prize winner Wassily W. 
Leontief in the field of econometrics - the quantitative study 
of economic principles [T7] . According to the Leontief input- 
output model, the economy of a country may be divided into 
any desired number of sectors, called industries, each con- 
sisting of firms producing a similar product. Each industry 
requires certain inputs in order to produce a unit of its own 
product, and sells its products to other industries to meet 
their ingredient requirements. The aim is to find prices for 
the unit of product produced by each industry that guar- 
antee the reproducibility of the economy, which holds when 
each sector balances the costs for its inputs with the rev- 
enues of its outputs. In 1973, Leontief earned the Nobel 
Prize in economics for his work on the input-output model. 
An example is provided in Tabled 

Let qij denote the quantity produced by the ith industry 
and used by the jih industry, and qi be the total quantity 
produced by sector i, that is, qt — Y^jQiJ- Let A = (dij) 
be such that Ojj = qi,j/qj', each coefficient aij represents 
the amount of product (produced by industry) i consumed 
by industry j that is necessary to produce a unit of product 
j. Let 7Tj be the price for the unit of product produced by 
each industry j. The reproducibility of the economy holds 
when each sector j balances the costs for its inputs with the 
revenues of its outputs, that is: 

COStj = J^i ^ili,} = 

revenue j = £\ KjQj.i = Tj Ei <b,i = 
By dividing each balance equation by qj we have 

TTj = 71";—^- = TTiOij 

i ^ i 

or, in matrix notation, 

7T = nA (6) 

Hence, highly remunerated industries (industries j with high 
total revenue njqj) are those that receive substantial inputs 
from highly remunerated industries, a circularity that closely 
resembles the PageRank thesis [24]. With the same argu- 
ment used in [3] for the Pinski and Narin bibliometric model 
we can show that the spectral radius of matrix A is 1, thus 
the equilibrium price vector n is the dominant eigenvector of 
matrix A. Such a solution always exists, although it might 
not be unique, unless A is irreducible. Notice the striking 
similarity of the Leontief closed model with that proposed 
by Pinski and Narin. An open Leontief model adds an ex- 
ogenous demand and creates a surplus of revenue (profit). It 
is described by the equation n = nA + v where v is the profit 
vector. Hubbell himself observes the similarity between his 
model and the Leontief open model [9]. 





agriculture 


industry 


family 


total 


price 


revenue 


agriculture 


7.5 


6 


16.5 


30 


20 


600 


industry 


14 


6 


30 


50 


15 


750 


family 


80 


180 


40 


300 


3 


900 


cost 


600 


750 


900 





Table 2: An input-output table for an economy with three sectors with the balance solution. Each row shows 
the output of a sector to other sectors of the economy. Each column shows the inputs received by a sector 
from other sectors. For each sector we also show total quantity produced, equilibrium unitary price, total 
cost, and total revenue. Notice that each sector balances costs and revenues. 



It might seem disputable to juxtapose PageRank and Leon- 
tief methods. To be sure, the original motivation of Leontief 
work was to give a formal method to find equilibrium prices 
for the reproducibility of the economy and to use the method 
to estimate the impact on the entire economy of the change 
in demand in any sectors of the economy. Leontief, to the 
best of our limited knowledge, was not motivated by an in- 
dustry ranking problem. On the other hand, the motivation 
underlying the other methods described in this paper is the 
ranking of a set of homogeneous entities. Despite the orig- 
inal motivations, however, there are more than coinciden- 
tal similarities between the Leontief open and closed mod- 
els and the other ranking methods described in this paper. 
These connections motivated the discussion of the Leontief 
contribution, which is probably the least known among the 
surveyed methods within the computing community. 

6. CONCLUSION 

The classic notion of quality of information is related to the 
judgment given by few field experts. PageRank introduced 
an original notion of quality of information found on the 
Web: the collective intelligence of the Web, formed by the 
opinions of the millions of people that populate this universe, 
is exploited to determine the importance, and ultimately the 
quality, of that information. 

Consider the difference between expert evaluation and collec- 
tive evaluation. The former tends to be intrinsic, subjective, 
deep, slow and expensive. By contrast, the latter is typ- 
ically extrinsic, democratic, superficial, fast and low-cost. 
Interestingly, the dichotomy between these two evaluation 
methodologies is not peculiar to information found on the 
Web. In the context of assessment of academic research, 
peer review - the evaluation of scholar publications given 
by peer experts working in the same field of the publication 
- plays the role of expert evaluation. Collective evaluation 
consists in gauging the importance of a contribution though 
the bibliometric practice of counting and analysing citations 
received by the publication from the academic community. 
Citations generally witness the use of information and ac- 
knowledge intellectual debt. Eigenfactor [33], a PageRank- 
inspired bibliometric indicator, is among the most interest- 
ing recent proposals to collectively evaluate the status of 
academic journals. The consequences of a shift from peer 
review to bibliometric evaluation are currently heartily de- 
bated in the academic community |32| . 
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